40 research outputs found
Blaeu: Mapping and navigating large tables with cluster analysis
Blaeu is an interactive database exploration tool. Its aim is to guide casual users through large data tables, ultimately triggering insights and serendipity. To do so, it relies on a double cluster analysis mechanism. It clusters the data vertically: it detects themes, groups of mutually dependent columns that highlight one aspect of the data. Then it clusters the data horizontally. For each theme, it produces a data map, an interactive visualization of the clusters in the table. The data maps summarize the data. They provide a visual synopsis of the clusters, as well as facilities to inspect their content and annotate them. But they also let the users navigate further. Our explorers can change the active set of columns or drill down into the clusters to refine their selection. Our prototype is fully operational, ready to deliver insights from complex databases
Genome sequence analysis with MonetDB: a case study on Ebola virus diversity
Next-generation sequencing (NGS) technology has led the life sciences into the big data era. Today, sequencing genomes takes little time and cost, but results in terabytes of data to be stored and analysed. Biologists are often exposed to excessively time consuming and error-prone data management and analysis hurdles. In this paper, we propose a database management system (DBMS) based approach to accelerate and substantially simplify genome sequence analysis. We have extended MonetDB, an open-source column-based DBMS, with a BAM module, which enables easy, flexible, and rapid management and analysis of sequence alignment data stored as Sequence Alignment/Map (SAM/BAM) files. We describe the main features of MonetDB/BAM using a case study on Ebola virus genomes
Genome sequence analysis with MonetDB - A case study on Ebola virus diversity
Next-generation sequencing (NGS) technology has led the life sciences into the big data era.
Today, sequencing genomes takes little time and cost, but yields terabytes of data to be stored and analyzed.
Biologists are often exposed to excessively time consuming and error-prone data
management and analysis hurdles.
In this paper, we propose a database management system (DBMS) based
approach to accelerate and substantially simplify genome sequence analysis.
We have extended MonetDB, an open-source
column-based DBMS, with a BAM module, which enables \textit{easy},
\textit{flexible}, and \textit{rapid} management and analysis of sequence
alignment data stored as Sequence Alignment/Map \\(SAM/BAM) files.
We describe the main features of MonetDB/BAM using a case study on Ebola
virus \\genomes
Genome sequence analysis with MonetDB: a case study on Ebola virus diversity
Abstract: Next-generation sequencing (NGS) technology has led the life sciences into the big data era. Today, sequencing genomes takes little time and cost, but results in terabytes of data to be stored and analysed. Biologists are often exposed to excessively time consuming and error-prone data management and analysis hurdles. In this paper, we propose a database management system (DBMS) based approach to accelerate and substantially simplify genome sequence analysis. We have extended MonetDB, an open-source column-based DBMS, with a BAM module, which enables easy, flexible, and rapid management and analysis of sequence alignment data stored as Sequence Alignment/Map (SAM/BAM) files. We describe the main features of MonetDB/BAM using a case study on Ebola virus genomes
Computational pan-genomics: status, promises and challenges
International audienceMany disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains
Computational pan-genomics: Status, promises and challenges
Many disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different Computational methods and paradigms are needed.We will witness the rapid extension of Computational pan-genomics, a new sub-area of research in Computational biology. In this article, we generalize existing definitions and understand a pangenome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a Computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations
Bridging the gap between Big Genome Data Analysis and Database Management Systems
The bioinformatics field has encountered a data deluge over the last years, due to in-
creasing speed and decreasing cost of DNA sequencing technology. Today, sequencing
the DNA of a single genome only takes about a week, and it can result in up to a ter-
abyte of data. The sequencing data are usually stored in files, and specialized tools have
been designed to analyze and manage them. Despite of these tools, bioinformaticians are
still exposed to many data management hurdles when analyzing these files, which often
leads to excessively time consuming tasks.
In this thesis, we accurately map the needs of bioinformaticians by defining a set of
use cases that reflect the everyday analysis that is applied on genetic data. We propose a
modern-DBMS based approach, to analyze and manage genetic data file repositories. We
identify the pros and cons of this method compared to the traditional file-based approach.
Additionally, we experimented with a novel in-situ approach, where the DBMS ap-
plies Just-In-Time ETL (Extract-Transform-Load) on the original files instead of loading
all data from these files up front. A major advantage of this approach is that it greatly
reduces the data-to-query time, since not all data are loaded in the DBMS during initial-
ization. Other advantages include the decrease in storage requirements and the reduced
data duplication.
With this project, we have taken the first step towards the adaptation of the state-of-
the-art database technology to accelerate genetic data analytics. The preliminary results
presented in this thesis are highly promising and they open up a plethora of new research
opportunities
A case study on Ebola virus diversity
Cijvat R, Manegold S, Kersten M, et al. Genome sequence analysis with MonetDB. Datenbank-Spektrum. 2015;15(3):185-191